Correlation-based Attribute Selection using Genetic Algorithm
نویسندگان
چکیده
A Data Warehouse (DW) is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site. DWs are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing. Integration of data sources refers to the task of developing a common schema as well as data transformation solutions for a number of data sources with related content. The large number and size of modern data sources make the process cumbersome. In such cases attribute subset selection is done on the basis of relevance analysis, in the form of correlation analysis to detect attributes that do not contribute much as far as characteristics of whole data is concern. After which the redundant attribute or attribute strongly correlated to some other attribute is disqualified to be the part of DW. Automated tools based on the existing methods for attribute subset selection may not yield optimal set of attributes all the time, which may degrade the performance of DW. This paper formulates and validates a method for selecting optimal attribute subset based on correlation using Genetic
منابع مشابه
A Multi-Mode Resource-Constrained Optimization of Time-Cost Trade-off Problems in Project Scheduling Using a Genetic Algorithm
In this paper, we present a genetic algorithm (GA) for optimization of a multi-mode resource constrained time cost trade off (MRCTCT) problem. The proposed GA, each activity has several operational modes and each mode identifies a possible executive time and cost of the activity. Beyond earlier studies on time-cost trade-off problem, in MRCTCT problem, resource requirements of each execution mo...
متن کاملA Framework for Optimal Attribute Evaluation and Selection in Hesitant Fuzzy Environment Based on Enhanced Ordered Weighted Entropy Approach for Medical Dataset
Background: In this paper, a generic hesitant fuzzy set (HFS) model for clustering various ECG beats according to weights of attributes is proposed. A comprehensive review of the electrocardiogram signal classification and segmentation methodologies indicates that algorithms which are able to effectively handle the nonstationary and uncertainty of the signals should be used for ECG analysis. Ex...
متن کاملApplication of genetic algorithm (GA) to select input variables in support vector machine (SVM) for analyzing the occurrence of roach, Rutilus rutilus, in streams
Support vector machine (SVM) was used to analyze the occurrence of roach in Flemish stream basins (Belgium). Several habitat and physico?chemical variables were used as inputs for the model development. The biotic variable merely consisted of abundance data which was used for predicting presence/absence of roach. Genetic algorithm (GA) was combined with SVM in order to select the most important...
متن کاملAn Evolutionary Algorithm Based on a Hybrid Multi-Attribute Decision Making Method for the Multi-Mode Multi-Skilled Resource-constrained Project Scheduling Problem
This paper addresses the multi-mode multi-skilled resource-constrained project scheduling problem. Activities of real world projects often require more than one skill to be accomplished. Besides, in many real-world situations, the resources are multi-skilled workforces. In presence of multi-skilled resources, it is required to determine the combination of workforces assigned to each activity. H...
متن کاملComparative Study of Attribute Selection Using Gain Ratio and Correlation Based Feature Selection
Feature subset selection is of great importance in the field of data mining. The high dimension data makes testing and training of general classification methods difficult. In the present paper two filters approaches namely Gain ratio and Correlation based feature selection have been used to illustrate the significance of feature subset selection for classifying Pima Indian diabetic database (P...
متن کامل